Scenario¶
XYZ is a private firm in US. Due to remarkable growth in the Cab Industry in last few years and multiple key players in the market, it is planning for an investment in Cab industry and as per their Go-to-Market(G2M) strategy they want to understand the market before taking final decision.
Project Delivery¶
Having been provided with multiple datasets that contain information on 2 cab companies from 31/01/2016 to 31/12/2018, XYZ is interested in using actionable insights derived from my analysis to help them identify the right company to make their investment.
My goal is to analyze the given datasets to ascertain the best investment option for XYZ.
PHASE 1: ASK¶
The guiding questions of this analysis include:
- Which company has maximum cab users at a particular time period?
- Does margin proportionally increase with increase in number of customers?
- What are the attributes of these customer segments?
- Is there any seasonality in number of customers using the cab service?
PHASE 2: PREPARE¶
Importing Needed Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
import plotly.graph_objects as go
from collections import Counter
from IPython.display import display
warnings.filterwarnings("ignore")
Reading Datasets
cab_data = pd.read_csv('Cab_Data.csv')
city_data = pd.read_csv('City.csv')
customer_data = pd.read_csv('Customer_ID.csv')
transaction_data = pd.read_csv('Transaction_ID.csv')
PHASE 3: PROCESS¶
Check how data looks like, starting with the cab_data dataset
cab_data.head()
| Transaction ID | Date of Travel | Company | City | KM Travelled | Price Charged | Cost of Trip | |
|---|---|---|---|---|---|---|---|
| 0 | 10000011 | 42377 | Pink Cab | ATLANTA GA | 30.45 | 370.95 | 313.635 |
| 1 | 10000012 | 42375 | Pink Cab | ATLANTA GA | 28.62 | 358.52 | 334.854 |
| 2 | 10000013 | 42371 | Pink Cab | ATLANTA GA | 9.04 | 125.20 | 97.632 |
| 3 | 10000014 | 42376 | Pink Cab | ATLANTA GA | 33.17 | 377.40 | 351.602 |
| 4 | 10000015 | 42372 | Pink Cab | ATLANTA GA | 8.73 | 114.62 | 97.776 |
Check the structure of the dataset
city_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20 entries, 0 to 19 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 City 20 non-null object 1 Population 20 non-null object 2 Users 20 non-null object dtypes: object(3) memory usage: 612.0+ bytes
Check for null values in each column
cab_data.isnull().sum(axis = 0)
Transaction ID 0 Date of Travel 0 Company 0 City 0 KM Travelled 0 Price Charged 0 Cost of Trip 0 dtype: int64
Key Observations Thus Far:
- The cab_data dataset has 7 columns and 359391 rows.
- There are no null values in any of the columns.
- Values in all the columns are in the correct format, except for the "Date of Travel" column, which has its 5-digit values in int(64) type, instead of Date.
Convert "Date_of_Travel" column to Date type
We will use the given start date (31/01/2016) and end date (31/12/2018) as hint to find the dates
numeric_start_date = cab_data['Date of Travel'].min()
numeric_end_date = cab_data['Date of Travel'].max()
print("Start Date: %d\nEnd Date: %d" %(numeric_start_date,numeric_end_date))
Start Date: 42371 End Date: 43465
42371 corresponds to 31/01/2016
43465 corresponds to 31/12/2018
Since start date 42371 corresponds to 31/01/2016, we will use the following formula to calculate every other date:
Date = start_date + (Numeric_date_value - 42371) days
We will use the end_date to verify the validity of this formula.
end_date = 31/01/2016 + (43465 - 42371)days
= 31/01/2016 + 1094 days
= 01/29/2019
However, the end_date was stated to be 31/12/2018 (not 01/29/2019) in the problem statement. There seems to be a bit of innacuracy in the numeric date values, with a margin of error of about 29 days. Nonetheless, we will stick with the original start date and the new end date
Date = 31/01/2016 + (Numeric_value - 42371)days
Calculate End Date
start_date = "01/31/2016"
end_date = pd.to_datetime(start_date) + pd.DateOffset(days = numeric_end_date-numeric_start_date)
end_date
Timestamp('2019-01-29 00:00:00')
Convert all dates from numeric to date type
def get_date(numeric_date):
the_date = pd.to_datetime(start_date) + pd.DateOffset(days = numeric_date - numeric_start_date)
return the_date
NB: The next cell takes quite a bit of time to load
cab_data['Date of Travel'] = cab_data['Date of Travel'].apply(lambda numeric_dates: get_date(numeric_dates))
Preliminary data cleaning for the cab_data dataset is now done, we will now move on to the other three datasets
transaction_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 440098 entries, 0 to 440097 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Transaction ID 440098 non-null int64 1 Customer ID 440098 non-null int64 2 Payment_Mode 440098 non-null object dtypes: int64(2), object(1) memory usage: 10.1+ MB
customer_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49171 entries, 0 to 49170 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer ID 49171 non-null int64 1 Gender 49171 non-null object 2 Age 49171 non-null int64 3 Income (USD/Month) 49171 non-null int64 dtypes: int64(3), object(1) memory usage: 1.5+ MB
city_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20 entries, 0 to 19 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 City 20 non-null object 1 Population 20 non-null object 2 Users 20 non-null object dtypes: object(3) memory usage: 612.0+ bytes
Convert the Population and Users columns to int type in the city_data dataset
city_data['Users'] = city_data['Users'].str.replace(',','').astype(int)
city_data['Population'] = city_data['Population'].str.replace(',','').astype(int)
Merge datasets to create master dataset
master_data = cab_data.merge(transaction_data, on="Transaction ID", how = 'inner')
master_data = master_data.merge(customer_data, on='Customer ID', how='inner')
master_data = master_data.merge(city_data, on="City", how="inner")
Get years, months and days out of Date of Travel and get profits out of Price Charged and Cost of Trip
master_data['Years'] = master_data['Date of Travel'].dt.year
master_data['Months'] = master_data['Date of Travel'].dt.month_name()
master_data['Days'] = master_data['Date of Travel'].dt.day_name()
master_data['Profits'] = master_data['Price Charged']-master_data['Cost of Trip']
Reorder columns in master dataset
master_data = master_data.reindex(columns=['Transaction ID','Date of Travel','Years','Months','Days',
'Customer ID', 'Company', 'City', 'Population','Users',
'KM Travelled', 'Price Charged', 'Cost of Trip','Profits',
'Payment_Mode', 'Gender','Age', 'Income (USD/Month)'])
Rename columns in master dataset
master_data.columns = ['transaction_id','date_of_travel','years','months','days','customer_id', 'company', 'city',
'population','users','km_travelled', 'price_charged', 'cost_of_trip','profits','payment_mode',
'gender','age', 'income_usd_per_month']
master_data.head()
| transaction_id | date_of_travel | years | months | days | customer_id | company | city | population | users | km_travelled | price_charged | cost_of_trip | profits | payment_mode | gender | age | income_usd_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000011 | 2016-02-06 | 2016 | February | Saturday | 29290 | Pink Cab | ATLANTA GA | 814885 | 24701 | 30.45 | 370.95 | 313.6350 | 57.3150 | Card | Male | 28 | 10813 |
| 1 | 10351127 | 2018-08-19 | 2018 | August | Sunday | 29290 | Yellow Cab | ATLANTA GA | 814885 | 24701 | 26.19 | 598.70 | 317.4228 | 281.2772 | Cash | Male | 28 | 10813 |
| 2 | 10412921 | 2018-12-22 | 2018 | December | Saturday | 29290 | Yellow Cab | ATLANTA GA | 814885 | 24701 | 42.55 | 792.05 | 597.4020 | 194.6480 | Card | Male | 28 | 10813 |
| 3 | 10000012 | 2016-02-04 | 2016 | February | Thursday | 27703 | Pink Cab | ATLANTA GA | 814885 | 24701 | 28.62 | 358.52 | 334.8540 | 23.6660 | Card | Male | 27 | 9237 |
| 4 | 10320494 | 2018-05-20 | 2018 | May | Sunday | 27703 | Yellow Cab | ATLANTA GA | 814885 | 24701 | 36.38 | 721.10 | 467.1192 | 253.9808 | Card | Male | 27 | 9237 |
PHASE 4: ANALYZE¶
yellow_cab=len(master_data.groupby('company').get_group('Yellow Cab'))/len(master_data['company'])
pink_cab=len(master_data.groupby('company').get_group('Pink Cab'))/len(master_data['company'])
Check each company's percentage share of the total cab transactionss
values = np.array([yellow_cab,pink_cab])
mlabels = ['Yellow Cab','Pink Cab']
plt.pie(values, labels=mlabels, autopct='%1.2f%%', shadow=True, explode = [0,0.08],startangle=90)
plt.tight_layout()
plt.title("Cab Company Share of All Journeys", bbox={'facecolor':'gray', 'pad':8}, color='white')
plt.xlabel('''\n 2016-01-31 to 2019-01-29 ''')
Text(0.5, 47.24444444444444, '\n 2016-01-31 to 2019-01-29 ')
Key Note:
- The Yellow Cab is used a lot more by customers than the Pink Cab
plt.figure(figsize = (16, 9))
order=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]
sns.countplot(x="days", hue="company", order=order, data=master_data,palette="Set2",edgecolor = "black")
plt.title('Daily Cab Usage', fontsize=25)
plt.ylabel('Frequency of Usage', fontsize=12)
plt.xlabel('Day', fontsize=12)
plt.show()
Check each company's total cab transactions each month
pinkprofit=master_data[master_data.company=='Pink Cab'].groupby('months').company.count()
yellowprofit=master_data[master_data.company=='Yellow Cab'].groupby('months').company.count()
print("Pink Cab Profit:->\n", pinkprofit)
print("\nYellow Cab Profit:->\n", yellowprofit)
Pink Cab Profit:-> months April 4097 August 6455 December 11570 February 4483 January 11298 July 6099 June 4945 March 3621 May 4218 November 10069 October 9944 September 7912 Name: company, dtype: int64 Yellow Cab Profit:-> months April 16909 August 20671 December 33307 February 16276 January 34770 July 19534 June 18430 March 16052 May 17914 November 28737 October 28228 September 23853 Name: company, dtype: int64
plt.figure(figsize = (16, 9))
order=["January","February","March","April","May","June","July","August","September","October","November","December"]
sns.countplot(x="months", hue="company", data=master_data, order=order,palette="Set1",edgecolor = "black")
plt.title('Monthly Cab Usage', fontsize=25)
plt.ylabel('Frequency of Usage', fontsize=12)
plt.xlabel('Month', fontsize=12)
plt.show()
Key Note:
- There is considerable seasonality in Cab Usage
- Cab usage rises steadily from February to January of the next year
Check each company's total transcations per year
plt.figure(figsize = (10, 7))
ax = sns.countplot(x="years", hue="company", data=master_data)
plt.title('Yearly Cab Usage', fontsize=25)
plt.ylabel('Frequency of Patronage')
plt.show()
We will now check each company's average profits per year and the difference between the two
yellowcab_profits = master_data[master_data['company']=='Yellow Cab'].groupby(master_data['years']).profits.mean()
pinkcab_profits = master_data[master_data['company']=='Pink Cab'].groupby(master_data['years']).profits.mean()
diff=yellowcab_profits-pinkcab_profits
fig = go.Figure(data=[
go.Bar(
x=['2016','2017','2018','2019'],
y=yellowcab_profits.values,
name='Yellow Cab',
marker_color='yellow'),
go.Bar(
x=['2016','2017','2018','2019'],
y=pinkcab_profits.values,
name='Pink Cab',
marker_color='#FF1493')
])
fig.add_trace(go.Bar(
x=['2016','2017','2018','2019'],
y=diff.values,
name='Difference',
marker_color='burlywood'
))
fig.update_layout(barmode='group', xaxis_tickangle=-45,plot_bgcolor='grey',title = 'Average Profits Per Year', title_x=0.5,
yaxis_title="Average Profits Gained", xaxis_title="Year",)
display(fig)
difff = round(diff.mean(),2)
print("Average Difference: "+str(difff))
Average Difference: 89.86
Key Notes:
- The Yellow Cab has its average profits being significantly higher than the Pink Cab's
- From 2016 to 2019, there is an average profit
yellowcab_profits = master_data[master_data['company']=='Yellow Cab'].groupby(master_data['years']).profits.mean()
pinkcab_profits = master_data[master_data['company']=='Pink Cab'].groupby(master_data['years']).profits.mean()
plt.figure(figsize = (5, 3))
ax=plt.axes()
ax.set(facecolor = "grey")
plt.plot(['2016','2017','2018','2019'], pinkcab_profits.values, color='deeppink')
plt.plot(['2016','2017','2018','2019'], yellowcab_profits.values, color='yellow')
plt.title('Yearly Average Profit Trend', fontsize=15)
plt.ylabel('Average Profit',fontsize=10)
plt.xlabel('Year',fontsize=10)
plt.legend(['Pink Cab','Yellow cab'], facecolor="purple")
plt.figure(facecolor='yellow')
plt.show()
<Figure size 640x480 with 0 Axes>
Key Note:
- Despite the Yellow Cab having much greater profits each year, its profits have been declining consistently from 2017 to 2019, whereas the Pink Cab's profits saw a surge from 2018 to 2019.
We will now check the cities where these companies have most of their users
cities = master_data.groupby('city').users.count()
index = cities.index
values = cities.values
fig = go.Figure(data=[go.Pie(labels=index, values=values,name="Cities")])
fig.update_traces( textinfo='percent+label',textfont_size=10, hoverinfo="label+value+name",
marker=dict(line=dict(color='grey', width=2)))
fig.update_layout(title_x=0.45, title_text="Cab Companies' Cities of Operation",
annotations=[dict(text='', x=0.5, y=0.5, font_size=20, showarrow=False)],autosize=False,
width=900,
height=800, showlegend=False)
display(fig)
Key Note:
- New York, Chicago, Los Angeles, Washington, DC and Boston are the top 5 cities of operation for these two cab companies, with respect to number of users.
yellow_profits_per_city = master_data[master_data['company']=='Yellow Cab'].groupby(master_data['city']).profits.mean()
pink_profits_per_city = master_data[master_data['company']=='Pink Cab'].groupby(master_data['city']).profits.mean()
diff=yellowcab_profits-pinkcab_profits
fig = go.Figure()
fig.add_trace(go.Bar(
x=yellow_profits_per_city.index,
y=yellow_profits_per_city.values,
name='Yellow Cab',
marker_color='yellow'
))
fig.add_trace(go.Bar(
x=yellow_profits_per_city.index,
y=pink_profits_per_city.values,
name='Pink Cab',
marker_color='#FF1493'
))
fig.update_layout(barmode='group', xaxis_tickangle=-45,plot_bgcolor='grey',title = 'Average Profits Per City', title_x=0.5,
yaxis_title="Average Profits", xaxis_title="City")
display(fig)
Key Note:
- The cities with most users are not necessarily the cities where the companies make the most average profit
- New York, Silicon Valley and Dallas yield the highest average profits for the Yellow Cab company and Dallas is replaced by Miami for the Pink Cab company
We will now analyze each company's customer retention
pink_cab = master_data[master_data['company']=='Pink Cab']
pink_cab_customers = pink_cab['customer_id']
yellow_cab = master_data[master_data['company']=='Yellow Cab']
yellow_cab_customers = yellow_cab['customer_id']
pink_val_counts = Counter(pink_cab_customers)
pink_val_and_counts_5 = {value: count for value, count in pink_val_counts.items() if count > 5}
pink_val_and_counts_5_perc = round(len(pink_val_and_counts_5)/len(pink_cab_customers)*100,2)
pink_val_and_counts_10 = {value: count for value, count in pink_val_counts.items() if count > 10}
pink_val_and_counts_10_perc = round(len(pink_val_and_counts_10)/len(pink_cab_customers)*100,2)
pink_val_and_counts_15 = {value: count for value, count in pink_val_counts.items() if count > 15}
pink_val_and_counts_15_perc = round(len(pink_val_and_counts_15)/len(pink_cab_customers)*100,2)
yellow_val_counts = Counter(yellow_cab_customers)
yellow_val_and_counts_5 = {value: count for value, count in yellow_val_counts.items() if count > 5}
yellow_val_and_counts_5_perc = round(len(yellow_val_and_counts_5)/len(yellow_cab_customers)*100,2)
yellow_val_and_counts_10 = {value: count for value, count in yellow_val_counts.items() if count > 10}
yellow_val_and_counts_10_perc = round(len(yellow_val_and_counts_10)/len(yellow_cab_customers)*100,2)
yellow_val_and_counts_15 = {value: count for value, count in yellow_val_counts.items() if count > 15}
yellow_val_and_counts_15_perc = round(len(yellow_val_and_counts_15)/len(yellow_cab_customers)*100,2)
tb = pd.DataFrame({
'Cab Company': ['Pink', 'Yellow'],
'5+ Transaction Customers': [len(pink_val_and_counts_5), len(yellow_val_and_counts_5)],
'5+ Retention Rate': [pink_val_and_counts_5_perc, yellow_val_and_counts_5_perc],
'10+ Transaction Customers': [len(pink_val_and_counts_10), len(yellow_val_and_counts_10)],
'10+ Retention Rate': [pink_val_and_counts_10_perc, yellow_val_and_counts_10_perc],
'15+ Transaction Customers': [len(pink_val_and_counts_15), len(yellow_val_and_counts_15)],
'15+ Retention Rate': [pink_val_and_counts_15_perc, yellow_val_and_counts_15_perc],
})
fig = go.Figure(data=[go.Table(
header=dict(values=list(tb.columns),
fill_color='paleturquoise',
align='left'),
cells=dict(values=[tb['Cab Company'],tb['5+ Transaction Customers'],tb['5+ Retention Rate'],
tb['10+ Transaction Customers'], tb['10+ Retention Rate'],
tb['15+ Transaction Customers'], tb['15+ Retention Rate']],
fill_color='lavender',
align='left'))])
display(fig)
Key Notes:
- 5.33% of the Yellow Cab company's customers returned at least 5 times, compared to 4.36% for the Pink Cab
- We notice a more significant difference when checking the percentage of customers who returned at least 10 times - 3.53% for the Yellow Cab company and only 0.28% for the Pink Cab company.
- At 15+ retention also, the Yellow Cab leads the way again, with 1.99% retention, compared to only 0.01% for the Pink Cab
- We can confidently conclude that the Yellow Cab has better customer retention
We will now find the prevalent payment modes in each company's transactions
cash = master_data[master_data['payment_mode']=='Cash']
pinkcab_cash= round((len(cash[master_data['company']=='Pink Cab'])/len(master_data))*100,2)
yellowcab_cash= round((len(cash[master_data['company']=='Yellow Cab'])/len(master_data))*100,2)
card = master_data[master_data['payment_mode']=='Card']
pinkcab_card= round((len(card[master_data['company']=='Pink Cab'])/len(master_data))*100,2)
yellowcab_card= round((len(card[master_data['company']=='Yellow Cab'])/len(master_data))*100,2)
mode_value=[yellowcab_card, pinkcab_card, yellowcab_cash, pinkcab_cash]
mode_index =['Yellow Cab Card','Pink Cab Card','Yellow Cab Cash','Pink Cab Cash']
colors=['#FFFF00','#FF00FF','#DAA520','#9F2B68']
fig = go.Figure(data=[go.Pie(labels=mode_index, values=mode_value,name="Payment Modes")])
# Use `hole` to create a donut-like pie chart
fig.update_traces(textfont_size=20, hole=.4, hoverinfo="label+percent+name",marker=dict(colors=colors, line=dict(color='grey', width=2)))
fig.update_layout(title_x=0.45, title_text="Payment Modes for Transactions",
annotations=[dict(text='', x=0.5, y=0.5, font_size=20, showarrow=False)])
display(fig)
Key Notes:
- Cards are the preferred payment mode for customers
- Cards are used for payment 60% of the time, cash 40% - this is true for both Cab Companies.
We will now analyze each cab company's presence in various cities
city_data['percent'] = round((city_data['Users']/city_data['Population'])*100,2)
fig = px.bar(city_data, x = 'City', y = 'percent', color = 'City', title = "Percentage of City Population Using The Two Cabs Companies' Services")
fig.update_layout(xaxis_tickangle=-45,plot_bgcolor='grey', title_x=0.4, yaxis_title="User Percentage", xaxis_title="City")
display(fig)
Key Notes:
- The two Cab companies have over 30% of the entire population of Washington DC, San Francisco and Boston using their services.
- However, there is a lot of room for growth, especially considering the fact that they have less than 10% of the people in each of the other cities using their services
cities = master_data.groupby('city').profits.mean()
cities
fig = px.scatter(master_data, x='km_travelled', y='price_charged', trendline='ols',color='company', symbol='company')
display(fig)